Word and acoustic confidence annotation for large vocabulary speech recognition

نویسنده

  • Lin Lawrence Chase
چکیده

We present improvements in confidence annotation of automatic speech recognizer output for large vocabulary, speakerindependent systems. Several strong additions to the set of predictor variables used for this purpose are discussed. Extensions which allow prediction of separate types of errors, as opposed to the simple presence of an error, are presented. A new development, acoustic confidence annotation, is explored, in which a predictor is built that indicates the likely successes and failures of the acoustic models alone. Four separate learning mechanisms are compared in terms of their ability to provide good confidence annotations from the same set of predictor variables. Performance figures are reported on both read news (the North American Business news corpus) and conversational telephone speech (the Switchboard corpus), both in American English. The Sphinx-II system [1] is used for the NAB tests. The Janus system [2] is used for the Switchboard tests. 1. Annotation of Read Speech This section describes a confidence annotation system for the Sphinx-II recognizer. The data used is the speaker-independent portion of the NAB corpus. The development test set, used for training the confidence annotator, contains 1523 utterances. The evaluation test set contains 1162 utterances. We annotate hypothesized words with probabilities of membership in three classes: 1. correct (P (C)): the correct word is guessed within 2 frames of its location in the forced alignment of the reference transcript, or 2. incorrect/oov (P (OOV )): the word was guessed incorrectly because it “covered” a spoken word that was not in the recognizer’s dictionary, or 3. incorrect/other (P (Other)): the wrong word was guessed or the right word was guessed with an incorrect segmentation. The predictor variables are constructed from: The contents of an N-best list that contains complete word and phone segmentation and score information for each of 150 elements, including the best hypothesis, Language model score and calculation source information from the same best-scoring hypothesis, Words pronunciations and their frequencies in acoustic training materials, The results of a parallel “phone-only” decoding, in which the recognizer is not constrained to either phone sequences from the dictionary or word sequences in the language model in its recognition of phone sequences, The results of a completely unconstrained frame-by-frame decoding in which the best possible acoustic score and basephone are determined for each frame of the utterance, Three distance metrics between basephones, including a simple match count at the frame level, a phonologicallybased similarity measure (HWC ), and an empirically derived confusion-based distance measure, and After constructing predictor variables from these sources we compare various combinations in their ability to assign probabilities of error class membership to words hypothesized by the recognizer. Ability to successfully provide error class prediction is measured with the reduction in cross-entropy measure described in [3]. Useful groups of predictor variables include: 1. percPhAll: The percentage of frames in the hypothesized word whose basephones match the basephones in the phoneonly decoding. 2. Combined+Duration: The combined acoustic and language model score together with the duration (in frames) of the word. (This group is an approximation to the information typically available to the recognizer during its normal decoding passes.) 3. Nbest: The N-best homogeneity score, which is a measure of the weighted ratio of all paths which pass through the hypothesized word, as represented in the N-best list [4] [5] [6] [7]. 4. Nbest+AvgWF: The N-best homogeneity score together with the frame-averaged number of words present in the N-best list at the hypothesized location of the word in question. 5. LMscore: Language model score only. 6. LMscore+LMsource: Language model score together with the case/branch of its origin in the backoff algorithm which calculates the score. 7. avgHPAll: The average value of the HWC metric of the hypothesized word’s phone w.r.t. the phone-only decoding at the same point in the utterance. 8. percPhAll+avgHPAll: The combined effects of the simple match and HWC metrics. 9. ConfBest+avgSenR+avgHPBest+acNormS:A mix of several acoustic predictors that use the best senone at each frame as a reference or normalization. 10. TrainWC: Count of number of times the word was seen in the acoustic training data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...

متن کامل

Confidence measures for large vocabulary continuous speech recognition

In this paper, we present several confidence measures for large vocabulary continuous speech recognition. We propose to estimate the confidence of a hypothesized word directly as its posterior probability, given all acoustic observations of the utterance. These probabilities are computed on word graphs using a forward–backward algorithm. We also study the estimation of posterior probabilities o...

متن کامل

Confidence measures for hybrid HMM/ANN speech recognition

In this paper we introduce four acoustic confidence measures which are derived from the output of a hybrid HMM/ANN large vocabulary continuous speech recognition system. These confidence measures, based on local posterior probability estimates computed by an ANN, are evaluated at both phone and word levels, using the North American Business News corpus.

متن کامل

Acoustic confidence measures for segmenting broadcast news

In this paper we define an acoustic confidence measure based on the estimates of local posterior probabilities produced by a HMM/ANN large vocabulary continuous speech recognition system. We use this measure to segment continuous audio into regions where it is and is not appropriate to expend recognition effort. The segmentation is computationally inexpensive and provides reductions in both ove...

متن کامل

Automatic speech recognition using acoustic confidence conditioned language models

A modi ed decoding algorithm for automatic speech recognition (ASR) will be described which facilitates a closer coupling between the acoustic and language modeling components of a speech recognition system. This closer coupling is obtained by extracting word level measures of acoustic con dence during decoding, and making coded representations of these con dence measures available to the ASR n...

متن کامل

Acoustic and Word Lattice Based Algor

Word confidence scores are crucial for unsupervised learning in automatic speech recognition. In the last decade there has been a flourish of work on two fundamentally different approaches to compute confidence scores. The first paradigm is acoustic and the second is based on word lattices. The first approach is dataintensive and it requires to explicitly model the acoustic channel. The second ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997